## Acquire demographic data on tennis players
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.5
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.4.1
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(stringr)
library(plotly)
##
## Attaching package: 'plotly'
##
## The following object is masked from 'package:ggplot2':
##
## last_plot
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following object is masked from 'package:graphics':
##
## layout
library(scales)
##
## Attaching package: 'scales'
##
## The following object is masked from 'package:purrr':
##
## discard
##
## The following object is masked from 'package:readr':
##
## col_factor
#library(ggradar)
Note: The player demographics data contains 560602 players
So, the US, Spain, Australia, Germany, and Italy have had a lot of male players throughout history play in professional tennis.
Let me explain in plain English. Each game (first to 5 points but have to win by 2) one player does all the serving. Then the the other player serves the next game. Thus, we have: \[\text{Proportion of Service Games won} = \frac{\text{# of service games the player won}}{\text{# of games the player served}} \] Likewise, we have: \[\text{Proportion of Opponent Service Games broken} = \frac{\text{# of non service games the player won}}{\text{# of opponent service games}} \]
Plot probability of winning on the serve for the Big 3 (Nadal, Djokovic, and Federer) by season.
To calculate probability of winning on the serve use the following formula: \[\text{P(Winning on Serve)} = \frac{\text{# of times winning on 1st serve + # of times winning on 2nd serve}}{\text{# of serves}} \] Thus, P(Winning on Serve) is often computed at the game level, but can be extended to the season level by adding up these numbers for that player for the given year, which is what I do here.
Most of these are quite low as expected execpt for carpet. However, it turns out each player only had a very small number of games on carpet which would explain why the Ace Service rate is so high for carpet.
Check how many of each there are.
## right left U missing
## 1 15444 1395 39519 243
The demographics list contains any player who ever played on the ATP tour. Thus, some of the oldest playes have date of births around 1913. Thus, there are 56,602 players in that file. Given that U occurs mainly in the older players, I think U means the handedness of that player is “Unknown” but I’m not sure. And then some are actually missing which goes in a violin plot with no label. Overall, I will stick to using more recent data, from say the 1990’s and beyond.